Time-Sensitive Bandit Learning and Satisficing Thompson Sampling
نویسندگان
چکیده
The literature on bandit learning and regret analysis has focused on contexts where the goalis to converge on an optimal action in a manner that limits exploration costs. One shortcomingimposed by this orientation is that it does not treat time preference in a coherent manner.Time preference plays an important role when the optimal action is costly to learn relative tonear-optimal actions. This limitation has not only restricted the relevance of theoretical resultsbut has also influenced the design of algorithms. Indeed, popular approaches such as Thompsonsampling and UCB can fare poorly in such situations. In this paper, we consider discountedrather than cumulative regret, where a discount factor encodes time preference. We proposesatisficing Thompson sampling – a variation of Thompson sampling – and establish a strongdiscounted regret bound for this new algorithm.
منابع مشابه
Satisficing in Time-Sensitive Bandit Learning
Much of the recent literature on bandit learning focuses on algorithms that aim to converge on an optimal action. One shortcoming is that this orientation does not account for time sensitivity, which can play a crucial role when learning an optimal action requires much more information than near-optimal ones. Indeed, popular approaches such as upper-confidence-bound methods and Thompson samplin...
متن کاملBayesian bandits: balancing the exploration-exploitation tradeoff via double sampling
Reinforcement learning studies how to balance exploration and exploitation in realworld systems, optimizing interactions with the world while simultaneously learning how the world works. One general class of algorithms for such learning is the multiarmed bandit setting (in which sequential interactions are independent and identically distributed) and the related contextual bandit case, in which...
متن کاملDeep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling
Recent advances in deep reinforcement learning have made significant strides in performance on applications such as Go and Atari games. However, developing practical methods to balance exploration and exploitation in complex domains remains largely unsolved. Thompson Sampling and its extension to reinforcement learning provide an elegant approach to exploration that only requires access to post...
متن کاملBootstrapped Thompson Sampling and Deep Exploration
This technical note presents a new approach to carrying out the kind of exploration achieved by Thompson sampling, but without explicitly maintaining or sampling from posterior distributions. The approach is based on a bootstrap technique that uses a combination of observed and artificially generated data. The latter serves to induce a prior distribution which, as we will demonstrate, is critic...
متن کاملAnalysis of Thompson Sampling for Stochastic Sleeping Bandits
We study a variant of the stochastic multiarmed bandit problem where the set of available arms varies arbitrarily with time (also known as the sleeping bandit problem). We focus on the Thompson Sampling algorithm and consider a regret notion defined with respect to the best available arm. Our main result is anO(log T ) regret bound for Thompson Sampling, which generalizes a similar bound known ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1704.09028 شماره
صفحات -
تاریخ انتشار 2017